penguins datasetThe penguins dataset comes from Allison Horst’s palmerpenguins R package and records biometric measurements of different penguin species found at Palmer Station, Antarctica [@gormanEcologicalSexualDimorphism2014]. It is an alternative to the iris dataset example for exploratory data analysis (to avoid association of this 1935 dataset’s collector Ronald Fisher who “held strong views on race and eugenics”). Use of either dataset will be acceptable for submission of this lab (and mention of iris or Fisher will be dropped for next year).
# load R packages
librarian::shelf(cluster, dplyr, DT, factoextra, ggplot2, ggvoronoi, scales, h2o, palmerpenguins, tibble, vegan, vegan3d)
# set seed for reproducible results
set.seed(42)
# load the dataset
data("penguins")
# look at documentation in RStudio
if (interactive())
help(penguins)
# show data table
datatable(penguins)
# skim the table for a summary
#skim(penguins)
# remove the rows with NAs
penguins <- na.omit(penguins)
# plot length vs depth, species naive
ggplot(penguins, aes(bill_length_mm, bill_depth_mm)) +
geom_point()
# plot length vs depth, color by species
legend_pos <- theme(
legend.position = c(0.95, 0.05),
legend.justification = c("right", "bottom"),
legend.box.just = "right")
ggplot(penguins, aes(bill_length_mm, bill_depth_mm, color = species)) +
geom_point() +
legend_pos
penguins using kmeans()# cluster using kmeans
k <- 3 # number of clusters
penguins_k <- kmeans(penguins %>%
select(bill_length_mm, bill_depth_mm),
centers = k)
# show cluster result
penguins_k
## K-means clustering with 3 clusters of sizes 136, 85, 112
##
## Cluster means:
## bill_length_mm bill_depth_mm
## 1 38.42426 18.27794
## 2 50.90353 17.33647
## 3 45.50982 15.68304
##
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 3 1 1 1 1
## [75] 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 3 1 3 1 1 1 3 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 3 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 2
## [149] 3 2 3 3 3 3 3 3 3 2 3 3 3 2 3 2 3 2 2 3 3 3 3 3 3 3 2 3 3 3 2 2 2 3 3 3 2
## [186] 3 2 3 2 2 3 3 2 3 3 3 3 3 2 3 3 3 3 3 2 3 3 3 2 3 2 2 3 2 3 3 3 3 3 2 3 2
## [223] 3 3 2 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 2 3 3 2 3 2 3 2 3 3 2 3 3 2 2 3 2 3 2
## [260] 2 3 3 2 3 2 3 2 2 3 2 3 3 2 3 2 3 2 3 2 3 2 2 2 3 2 3 2 3 2 3 2 2 2 3 2 1
## [297] 2 3 2 2 3 3 2 3 2 2 3 2 3 2 2 2 2 2 2 3 2 3 2 3 2 3 2 2 3 2 3 3 2 3 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 904.9838 617.9859 742.0970
## (between_SS / total_SS = 79.8 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
# compare clusters with species (which were not used to cluster)
table(penguins_k$cluster, penguins$species)
##
## Adelie Chinstrap Gentoo
## 1 135 1 0
## 2 0 40 45
## 3 11 27 74
Bonus Question: How many observations could be considered “misclassified” if expecting bill length and bill depth to differentiate between species?
84 observations
# extract cluster assignment per observation
Cluster = factor(penguins_k$cluster)
ggplot(penguins, aes(bill_length_mm, bill_depth_mm, color = Cluster)) +
geom_point() +
legend_pos
Question 1: Comparing the observed species plot with 3 species with the kmeans() cluster plot with 3 clusters, where does this “unsupervised” kmeans() technique (that does not use species to “fit” the model) produce similar versus different results? One or two sentences would suffice. Feel free to mention ranges of values along the axes.
Answer
The kmeans() clustered the points more vertically along the x-axis than the actual species distribtion is. The majority of the Chinstrap points were split between clusters 1 & 2. From this plot, it seems that kmeans() weights x-axis values more heavily.
penguinsThis form of clustering assigns points to the cluster based on nearest centroid. You can see the breaks more clearly with a Voronoi diagram.
# define bounding box for geom_voronoi()
xr <- extendrange(range(penguins$bill_length_mm), f=0.1)
yr <- extendrange(range(penguins$bill_depth_mm), f=0.1)
box <- tribble(
~bill_length_mm, ~bill_depth_mm, ~group,
xr[1], yr[1], 1,
xr[1], yr[2], 1,
xr[2], yr[2], 1,
xr[2], yr[1], 1,
xr[1], yr[1], 1) %>%
data.frame()
# cluster using kmeans
k <- 3 # number of clusters
penguins_k <- kmeans(penguins %>%
select(bill_length_mm, bill_depth_mm),
centers = k)
# extract cluster assignment per observation
Cluster = factor(penguins_k$cluster)
# extract cluster centers
ctrs <- as.data.frame(penguins_k$centers) %>%
mutate(Cluster = factor(1:k))
# plot points with voronoi diagram showing nearest centroid
ggplot(penguins, aes(bill_length_mm, bill_depth_mm, color = Cluster)) +
geom_point() +
legend_pos +
geom_voronoi(data = ctrs, aes(fill=Cluster), color = NA, alpha=0.5,
outline = box) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
geom_point(data = ctrs, pch=23, cex=2, fill="black")
BONUS Task: Show the Voronoi diagram for fewer (k=2) and more (k=8) clusters to see how assignment to cluster centroids work.
Answer For k=2
# cluster using kmeans
k2 <- 2 # number of clusters
penguins_k2 <- kmeans(penguins %>%
select(bill_length_mm, bill_depth_mm),
centers = k2)
# extract cluster assignment per observation
Cluster2 = factor(penguins_k2$cluster)
# extract cluster centers
ctrs2 <- as.data.frame(penguins_k2$centers) %>%
mutate(Cluster2 = factor(1:k2))
# plot points with voronoi diagram showing nearest centroid
ggplot(penguins, aes(bill_length_mm, bill_depth_mm, color = Cluster2)) +
geom_point() +
legend_pos +
geom_voronoi(data = ctrs2, aes(fill=Cluster2), color = NA, alpha=0.5, outline = box) +
geom_point(data = ctrs2, pch=23, cex=2, fill="black") +
labs(title = "Voronoi diagram for 2 clusters")
Answer For k=8
# cluster using kmeans
k8 <- 8 # number of clusters
penguins_k8 <- kmeans(penguins %>%
select(bill_length_mm, bill_depth_mm),
centers = k8)
# extract cluster assignment per observation
Cluster8 = factor(penguins_k8$cluster)
# extract cluster centers
ctrs8 <- as.data.frame(penguins_k8$centers) %>%
mutate(Cluster8 = factor(1:k8))
# plot points with voronoi diagram showing nearest centroid
ggplot(penguins, aes(bill_length_mm, bill_depth_mm, color = Cluster8)) +
geom_point() +
legend_pos +
geom_voronoi(data = ctrs8, aes(fill=Cluster8), color = NA, alpha=0.5, outline = box) +
geom_point(data = ctrs8, pch=23, cex=2, fill="black") +
labs(title = "Voronoi diagram for 8 clusters")
Next, cluster sites according to species composition. Use the dune dataset from the vegan R package.
dune dataset# load dune dataset from package vegan
data("dune")
# show documentation on dataset if interactive
if (interactive())
help(dune)
Bonus Question: What are the rows and columns composed of in the dune data frame?
Answer
Columns are the species variable. Rows are the cover class observations for 30 species in 20 sites.
sitesBefore we calculate ecological distance between sites for dune, let’s look at these metrics with a simpler dataset, like the example given in Chapter 8 by @kindtTreeDiversityAnalysis2005.
sites <- tribble(
~site, ~sp1, ~sp2, ~sp3,
"A", 1, 1, 0,
"B", 5, 5, 0,
"C", 0, 0, 1) %>%
column_to_rownames("site")
sites
## sp1 sp2 sp3
## A 1 1 0
## B 5 5 0
## C 0 0 1
sites_manhattan <- vegdist(sites, method="manhattan")
sites_manhattan
## A B
## B 8
## C 3 11
sites_euclidean <- vegdist(sites, method="euclidean")
sites_euclidean
## A B
## B 5.656854
## C 1.732051 7.141428
sites_bray <- vegdist(sites, method="bray")
sites_bray
## A B
## B 0.6666667
## C 1.0000000 1.0000000
Question 7: In your own words, how does Bray Curtis differ from Euclidean distance?
Answer
The Bray Curtis distance differs from Euclidean in several ways:
- It is classified by a range of 0 to 1, where 0 is total similarity and 1 is total dissimilarity. Euclidean distance also shows similarity by numeric values closer to 0 but does not have a max value of 1. - Bray Curtis gives less weight to outliers than Euclidean - The distance is weighted on the species shared between sites rather than the total number of species at a site. Euclidean is weighted on species abundance. - Bray Curtis is not metric, unlike Euclidean, and therefore cannot be used in the ordination methods of Canonical correspondence analysis (CCA) or Principal components analysis (PCA).
sitesLet’s take a closer look at the Bray-Curtis Dissimilarity distance:
\[ B_{ij} = 1 - \frac{2C_{ij}}{S_i + S_j} \]
\(B_{ij}\): Bray-Curtis dissimilarity value between sites \(i\) and \(j\).
1 = completely dissimilar (no shared species); 0 = identical.
\(C_{ij}\): sum of the lesser counts \(C\) for shared species common to both sites \(i\) and \(j\)
\(S_{i OR j}\): sum of all species counts \(S\) for the given site \(i\) or \(j\)
So to calculate Bray-Curtis for the example sites:
\(B_{AB} = 1 - \frac{2 * (1 + 1)}{2 + 10} = 1 - 4/12 = 1 - 1/3 = 0.667\)
\(B_{AC} = 1 - \frac{2 * 0}{2 + 1} = 1\)
\(B_{BC} = 1 - \frac{2 * 0}{10 + 1} = 1\)
dune# Dissimilarity matrix
d <- vegdist(dune, method="bray")
dim(d)
## NULL
as.matrix(d)[1:5, 1:5]
## 1 2 3 4 5
## 1 0.0000000 0.4666667 0.4482759 0.5238095 0.6393443
## 2 0.4666667 0.0000000 0.3414634 0.3563218 0.4117647
## 3 0.4482759 0.3414634 0.0000000 0.2705882 0.4698795
## 4 0.5238095 0.3563218 0.2705882 0.0000000 0.5000000
## 5 0.6393443 0.4117647 0.4698795 0.5000000 0.0000000
# Hierarchical clustering using Complete Linkage
hc1 <- hclust(d, method = "complete" )
# Dendrogram plot of hc1
plot(hc1, cex = 0.6, hang = -1)
Question 9: Which function comes first, vegdist() or hclust(), and why?
Answer
In order to perform agglomerative hierarchical clustering with hclust(), you compute the dissimilarity values, using vegdist() here, to create a distance matrix. The distance matrix is a triangular matrix with all pairwise distances between sites. Then this matrix is feed into hclust() to create a cluster dendrogram. The distance values influence the shape of the cluster, thus have to be calculated first.
# Compute agglomerative clustering with agnes
hc2 <- agnes(dune, method = "complete")
# Agglomerative coefficient
hc2$ac
## [1] 0.5398129
# Dendrogram plot of hc2
plot(hc2, which.plot = 2)
# methods to assess
m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")
# function to compute coefficient
ac <- function(x) {
agnes(dune, method = x)$ac
}
# get agglomerative coefficient for each linkage method
purrr::map_dbl(m, ac)
## average single complete ward
## 0.4067153 0.2007896 0.5398129 0.6939994
# Compute ward linkage clustering with agnes
hc3 <- agnes(dune, method = "ward")
# Agglomerative coefficient
hc3$ac
## [1] 0.6939994
# Dendrogram plot of hc3
plot(hc3, which.plot = 2)
Question 11: In your own words how does hclust() differ from agnes()?
Answer
It seems like hclust() has broader cluster categories than agnes(). This may be due to the fact that agnes() contains 6 agglomerative algorithms, some of which are not included in the hclust() method. The agnes() method provides more cluster heights than hcluster().
Question 13: Of the 4 methods, which is the “best” model in terms of Agglomerative Coefficient?
Answer
Complete linkage and Ward’s method are the preferred agglomerative clustering.
duneSee text to accompany code: HOMLR 21.3.2 Divisive hierarchical clustering.
# compute divisive hierarchical clustering
hc4 <- diana(dune)
# Divise coefficient; amount of clustering structure found
hc4$dc
## [1] 0.5287677
Question 15: In your own words how does agnes() differ from diana()?
Answer
AGglomerative NESting is the bottom-up, or from leaves to roots of the dendrogram tree, clustering approach. Divisive analysis, or divisive clustering, is a “top-down”, or from roots to leaves, approach. DIANA is more accurate as it takes into consideration the entire distribution of the data when partitioning at top-levels. AGNES is good at detecting small clusters, while DIANA is effective at detecting large clusters.
# Plot cluster results
p1 <- fviz_nbclust(dune, FUN = hcut, method = "wss", k.max = 10) +
ggtitle("(A) Elbow method")
p2 <- fviz_nbclust(dune, FUN = hcut, method = "silhouette", k.max = 10) +
ggtitle("(B) Silhouette method")
p3 <- fviz_nbclust(dune, FUN = hcut, method = "gap_stat", k.max = 10) +
ggtitle("(C) Gap statistic")
# Display plots side by side
gridExtra::grid.arrange(p1, p2, p3, nrow = 1)
Question 17: How do the optimal number of clusters compare between methods for those with a dashed line?
Answer
The Gap statistic suggests the optimal number of clusters is 3, whereas the Silhouette method shows 4 clusters is best. There does not seem to be a definitive cluster amount using the Elbow method as the line trend has not leveled off.
# Construct dendorgram for the Ames housing example
hc5 <- hclust(d, method = "ward.D2" )
dend_plot <- fviz_dend(hc5)
dend_data <- attr(dend_plot, "dendrogram")
dend_cuts <- cut(dend_data, h = 8)
fviz_dend(dend_cuts$lower[[2]])
# Ward's method
hc5 <- hclust(d, method = "ward.D2" )
# Cut tree into 4 groups
k = 4
sub_grp <- cutree(hc5, k = k)
# Number of members in each cluster
table(sub_grp)
## sub_grp
## 1 2 3 4
## 8 6 4 2
# Plot full dendogram
fviz_dend(
hc5,
k = k,
horiz = TRUE,
rect = TRUE,
rect_fill = TRUE,
rect_border = "jco",
k_colors = "jco")
Question 18: In dendrogram plots, which is the biggest determinant of relatedness between observations: the distance between observations along the labeled axes or the height of their shared connection?
Answer
The height of the of their shared connection is the biggest determinant of relatedness.
# get data
url <- "https://koalaverse.github.io/homlr/data/my_basket.csv"
my_basket <- readr::read_csv(url)
dim(my_basket)
## [1] 2000 42
my_basket
## # A tibble: 2,000 × 42
## `7up` lasagna pepsi yop red.wine cheese bbq bulmers mayonnaise horlics
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 2 1 0 0 0 0 0
## 5 0 0 0 0 0 0 0 2 0 0
## 6 0 0 0 0 0 0 0 0 0 0
## 7 1 1 0 0 0 0 1 0 0 0
## 8 0 0 0 0 0 0 0 0 0 0
## 9 0 1 0 0 0 0 0 0 0 0
## 10 0 0 0 0 0 0 0 0 0 2
## # … with 1,990 more rows, and 32 more variables: chicken.tikka <dbl>,
## # milk <dbl>, mars <dbl>, coke <dbl>, lottery <dbl>, bread <dbl>,
## # pizza <dbl>, sunny.delight <dbl>, ham <dbl>, lettuce <dbl>,
## # kronenbourg <dbl>, leeks <dbl>, fanta <dbl>, tea <dbl>, whiskey <dbl>,
## # peas <dbl>, newspaper <dbl>, muesli <dbl>, white.wine <dbl>, carrots <dbl>,
## # spinach <dbl>, pate <dbl>, instant.coffee <dbl>, twix <dbl>,
## # potatoes <dbl>, fosters <dbl>, soup <dbl>, toad.in.hole <dbl>, …
h2o.no_progress() # turn off progress bars for brevity
h2o.init(max_mem_size = "5g") # connect to H2O instance
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 10 days 16 hours
## H2O cluster timezone: Etc/UTC
## H2O data parsing timezone: UTC
## H2O cluster version: 3.36.0.2
## H2O cluster version age: 12 days
## H2O cluster name: H2O_started_from_R_palomacartwright_qea893
## H2O cluster total nodes: 1
## H2O cluster total memory: 4.95 GB
## H2O cluster total cores: 64
## H2O cluster allowed cores: 64
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Amazon S3, XGBoost, Algos, Infogram, AutoML, Core V3, TargetEncoder, Core V4
## R Version: R version 4.0.5 (2021-03-31)
# convert data to h2o object
my_basket.h2o <- as.h2o(my_basket)
# run PCA
my_pca <- h2o.prcomp(
training_frame = my_basket.h2o,
pca_method = "GramSVD",
k = ncol(my_basket.h2o),
transform = "STANDARDIZE",
impute_missing = TRUE,
max_runtime_secs = 1000)
my_pca
## Model Details:
## ==============
##
## H2ODimReductionModel: pca
## Model ID: PCA_model_R_1643260067821_190
## Importance of components:
## pc1 pc2 pc3 pc4 pc5 pc6
## Standard deviation 1.513919 1.473768 1.459114 1.440635 1.435279 1.411544
## Proportion of Variance 0.054570 0.051714 0.050691 0.049415 0.049048 0.047439
## Cumulative Proportion 0.054570 0.106284 0.156975 0.206390 0.255438 0.302878
## pc7 pc8 pc9 pc10 pc11 pc12
## Standard deviation 1.253307 1.026387 1.010238 1.007253 0.988724 0.985320
## Proportion of Variance 0.037400 0.025083 0.024300 0.024156 0.023276 0.023116
## Cumulative Proportion 0.340277 0.365360 0.389659 0.413816 0.437091 0.460207
## pc13 pc14 pc15 pc16 pc17 pc18
## Standard deviation 0.970453 0.964303 0.951610 0.947978 0.944826 0.932943
## Proportion of Variance 0.022423 0.022140 0.021561 0.021397 0.021255 0.020723
## Cumulative Proportion 0.482630 0.504770 0.526331 0.547728 0.568982 0.589706
## pc19 pc20 pc21 pc22 pc23 pc24
## Standard deviation 0.931745 0.924207 0.917106 0.908494 0.903247 0.898109
## Proportion of Variance 0.020670 0.020337 0.020026 0.019651 0.019425 0.019205
## Cumulative Proportion 0.610376 0.630713 0.650739 0.670390 0.689815 0.709020
## pc25 pc26 pc27 pc28 pc29 pc30
## Standard deviation 0.894277 0.876167 0.871809 0.865912 0.855036 0.845130
## Proportion of Variance 0.019041 0.018278 0.018096 0.017852 0.017407 0.017006
## Cumulative Proportion 0.728061 0.746339 0.764436 0.782288 0.799695 0.816701
## pc31 pc32 pc33 pc34 pc35 pc36
## Standard deviation 0.842818 0.837655 0.826422 0.818532 0.813796 0.804380
## Proportion of Variance 0.016913 0.016706 0.016261 0.015952 0.015768 0.015405
## Cumulative Proportion 0.833614 0.850320 0.866581 0.882534 0.898302 0.913707
## pc37 pc38 pc39 pc40 pc41 pc42
## Standard deviation 0.796073 0.793781 0.780615 0.778612 0.763433 0.749696
## Proportion of Variance 0.015089 0.015002 0.014509 0.014434 0.013877 0.013382
## Cumulative Proportion 0.928796 0.943798 0.958307 0.972741 0.986618 1.000000
##
##
## H2ODimReductionMetrics: pca
##
## No model metrics available for PCA
Question 20: Why is the pca_method of “GramSVD” chosen over “GLRM”?
Answer
Because the my_basket data contains primarily numeric data. When there is primarily categorical data, using the GLRM method is recommended.
Question 21: How many initial principal components are chosen with respect to dimensions of the input data?
Answer
Forty-two principle components were chosen.
my_pca@model$eigenvectors %>%
as.data.frame() %>%
mutate(feature = row.names(.)) %>%
ggplot(aes(pc1, reorder(feature, pc1))) +
geom_point()
Question 23: What category of grocery items contribute most to PC1? (These are related because they’re bought most often together on a given grocery trip)
Answer
The category of grocery items that most contribute to PC1 is adult beverages.
my_pca@model$eigenvectors %>%
as.data.frame() %>%
mutate(feature = row.names(.)) %>%
ggplot(aes(pc1, pc2, label = feature)) +
geom_text()
Question 25: What category of grocery items contribute the least to PC1 but positively towards PC2?
Answer
The category of grocery items that most contribute the least to PC1 but positively towards PC2 is breakfast items (milk, coffee, tea, muesli, etc.).
# Compute eigenvalues
eigen <- my_pca@model$importance["Standard deviation", ] %>%
as.vector() %>%
.^2
# Sum of all eigenvalues equals number of variables
sum(eigen)
## [1] 42
## [1] 42
# Find PCs where the sum of eigenvalues is greater than or equal to 1
which(eigen >= 1)
## [1] 1 2 3 4 5 6 7 8 9 10
# Extract PVE and CVE
ve <- data.frame(
PC = my_pca@model$importance %>% seq_along(),
PVE = my_pca@model$importance %>% .[2,] %>% unlist(),
CVE = my_pca@model$importance %>% .[3,] %>% unlist())
# Plot PVE and CVE
ve %>%
tidyr::gather(metric, variance_explained, -PC) %>%
ggplot(aes(PC, variance_explained)) +
geom_point() +
facet_wrap(~ metric, ncol = 1, scales = "free")
# How many PCs required to explain at least 75% of total variability
min(which(ve$CVE >= 0.9))
## [1] 36
Question 27: How many principal components would you include to explain 90% of the total variance?
Answer
36 principal components would need to be included to explain 90% of the total variance.
# Screee plot criterion
data.frame(
PC = my_pca@model$importance %>% seq_along,
PVE = my_pca@model$importance %>% .[2,] %>% unlist()) %>%
ggplot(aes(PC, PVE, group = 1, label = PC)) +
geom_point() +
geom_line() +
geom_text(nudge_y = -.002)
Question 29: How many principal components to include up to the elbow of the PVE, i.e. the “elbow” before plateau of dimensions explaining the least variance?
Answer
Eight (8) principal components would need to be included.
Question 30: What are a couple of disadvantages to using PCA? See HOMLR 17.6 Final thoughts.
Answer
According to Boehmke & Greenwell, traditional PCA’s disadvantages include: - outlier influence. - under performance with complex nonlinear patterns.
# vegetation and environment in lichen pastures from Vare et al (1995)
data("varespec") # species
data("varechem") # chemistry
varespec %>% tibble()
## # A tibble: 24 × 44
## Callvulg Empenigr Rhodtome Vaccmyrt Vaccviti Pinusylv Descflex Betupube
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.55 11.1 0 0 17.8 0.07 0 0
## 2 0.67 0.17 0 0.35 12.1 0.12 0 0
## 3 0.1 1.55 0 0 13.5 0.25 0 0
## 4 0 15.1 2.42 5.92 16.0 0 3.7 0
## 5 0 12.7 0 0 23.7 0.03 0 0
## 6 0 8.92 0 2.42 10.3 0.12 0.02 0
## 7 4.73 5.12 1.55 6.05 12.4 0.1 0.78 0.02
## 8 4.47 7.33 0 2.15 4.33 0.1 0 0
## 9 0 1.63 0.35 18.3 7.13 0.05 0.4 0
## 10 24.1 1.9 0.07 0.22 5.3 0.12 0 0
## # … with 14 more rows, and 36 more variables: Vacculig <dbl>, Diphcomp <dbl>,
## # Dicrsp <dbl>, Dicrfusc <dbl>, Dicrpoly <dbl>, Hylosple <dbl>,
## # Pleuschr <dbl>, Polypili <dbl>, Polyjuni <dbl>, Polycomm <dbl>,
## # Pohlnuta <dbl>, Ptilcili <dbl>, Barbhatc <dbl>, Cladarbu <dbl>,
## # Cladrang <dbl>, Cladstel <dbl>, Cladunci <dbl>, Cladcocc <dbl>,
## # Cladcorn <dbl>, Cladgrac <dbl>, Cladfimb <dbl>, Cladcris <dbl>,
## # Cladchlo <dbl>, Cladbotr <dbl>, Cladamau <dbl>, Cladsp <dbl>, …
Question 31: What are the dimensions of the varespec data frame and what do rows versus columns represent?
Answer
The dimensions of the varespec data frame is 44 columns of species scientific names and 24 rows of estimated cover values of the 44 species, which creates sites. The varechem provides soil chemistry characteristics of the 24 sites in varespec data.
vare.dis <- vegdist(varespec)
vare.mds0 <- monoMDS(vare.dis)
stressplot(vare.mds0)
Question 33: The “stress” in a stressplot represents the difference between the observed input distance versus the fitted ordination distance. How much better is the non-metric (i.e., NMDS) fit versus a linear fit (as with PCA) in terms of \(R^2\)?
Answer
The non-metric fit resulted in \(R^2\) equal to 0.977. The linear fit \(R^2\) is 0.873. The non-metric fits better by a difference of 0.104 than the linear fit.
ordiplot(vare.mds0, type = "t")
Question 35: What two sites are most dissimilar based on species composition for the first component MDS1? And two more most dissimilar sites for the second component MDS2?
Answer
The two sites most dissimilar based on species composition for the first component MDS1 are 28 and 4. The two sites most dissimilar based on species composition for the second component MDS2 are 21 and 5.
vare.mds <- metaMDS(varespec, trace = FALSE)
vare.mds
##
## Call:
## metaMDS(comm = varespec, trace = FALSE)
##
## global Multidimensional Scaling using monoMDS
##
## Data: wisconsin(sqrt(varespec))
## Distance: bray
##
## Dimensions: 2
## Stress: 0.1843196
## Stress type 1, weak ties
## Two convergent solutions found after 20 tries
## Scaling: centring, PC rotation, halfchange scaling
## Species: expanded scores based on 'wisconsin(sqrt(varespec))'
plot(vare.mds, type = "t")
Question 38: What is the basic difference between metaMDS() and monoMDS()?
Answer
According to Minchin (1987), NMDS is regarded as the most robust unconstrained ordination method in community ecology. MetaMDS() function uses monoMDS() in its calculations provide actual NMDS. MetaMDS() selects solutions with the smallest stresses. It adds species scores to the site ordination.
ef <- envfit(vare.mds, varechem, permu = 999)
ef
##
## ***VECTORS
##
## NMDS1 NMDS2 r2 Pr(>r)
## N -0.05039 -0.99873 0.2081 0.075 .
## P 0.68702 0.72664 0.1755 0.137
## K 0.82729 0.56178 0.1657 0.156
## Ca 0.75015 0.66127 0.2811 0.038 *
## Mg 0.69675 0.71731 0.3494 0.013 *
## S 0.27626 0.96108 0.1774 0.136
## Al -0.83776 0.54603 0.5155 0.001 ***
## Fe -0.86197 0.50695 0.4001 0.006 **
## Mn 0.80233 -0.59688 0.5322 0.002 **
## Zn 0.66518 0.74668 0.1779 0.132
## Mo -0.84876 0.52878 0.0517 0.564
## Baresoil 0.87210 -0.48933 0.2494 0.050 *
## Humdepth 0.92636 -0.37664 0.5590 0.001 ***
## pH -0.79908 0.60123 0.2624 0.042 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 999
plot(vare.mds, display = "sites")
plot(ef, p.max = 0.05)
Question 40: What two soil chemistry elements have the strongest negative relationship with NMDS1 that is based on species composition?
Answer
The two soil chemistry elements that have the strongest negative relationship with NMDS1 based on species composition are Aluminum (Al) and Iron (Fe).
ef <- envfit(vare.mds ~ Al + Ca, data = varechem)
plot(vare.mds, display = "sites")
plot(ef)
tmp <- with(varechem, ordisurf(vare.mds, Al, add = TRUE))
ordisurf(vare.mds ~ Ca, data=varechem, add = TRUE, col = "green4")
##
## Family: gaussian
## Link function: identity
##
## Formula:
## y ~ s(x1, x2, k = 10, bs = "tp", fx = FALSE)
##
## Estimated degrees of freedom:
## 4.2 total = 5.2
##
## REML score: 158.4853
Question 42: Which of the two NMDS axes differentiates Ca the most, i.e. has the highest value given by the contours at the end (and not middle) of the axis? Describe in general terms (upper/lower/left/right/middle) where the highest and lowest values are found for Ca with respect to the ordination axes NMDS1 and NMDS2 (ie the ordination axes that describe the most variation of species composition between sites).
Answer:
The calcium (CA) vector’s trajectory increases towards to upper right section of the plot and is positive on both the NMDS1 and NMDS2 axes. The NMDS1 differentiates calcium the most. Calcium values, which vary from 300 to 800 and symbolized by green lines, are highest on the center and upper right sections of the plot.
See supporting text in vegantutor.pdf:
Technically, this uses another technique cca, or canonical correspondence analysis.
# ordinate on species constrained by three soil elements
vare.cca <- cca(varespec ~ Al + P + K, varechem)
vare.cca
## Call: cca(formula = varespec ~ Al + P + K, data = varechem)
##
## Inertia Proportion Rank
## Total 2.0832 1.0000
## Constrained 0.6441 0.3092 3
## Unconstrained 1.4391 0.6908 20
## Inertia is scaled Chi-square
##
## Eigenvalues for constrained axes:
## CCA1 CCA2 CCA3
## 0.3616 0.1700 0.1126
##
## Eigenvalues for unconstrained axes:
## CA1 CA2 CA3 CA4 CA5 CA6 CA7 CA8
## 0.3500 0.2201 0.1851 0.1551 0.1351 0.1003 0.0773 0.0537
## (Showing 8 of 20 unconstrained eigenvalues)
Question 43: What is the difference between “constrained” versus “unconstrained” ordination within ecological context?
Answer
Unconstrained ordination analyzes a single data matrix, whereas constrained ordination associates two or more data sets (Borcard 2018).
Unconstrained ordination:
- displays all variation in data
- useful for viewing the overall data pattern
Constrained ordination:
- uses prior hypothesis(es) to create plot
- relates a matrix of response variables to explanatory variables
- displays variation in the explanatory variable
- tests hypotheses and detects data trends
# plot ordination
plot(vare.cca)
Question 45: What sites are most differentiated by CCA1, i.e. furthest apart along its axis, based on species composition AND the environment? What is the strongest environmental vector for CCA1, i.e. longest environmental vector in the direction of the CCA1 axes?
Answer:
The sites are most differentiated by CCA1 are 28, 4. The strongest environmental vector for CCA1 is aluminum.
# plot 3 dimensions
ordiplot3d(vare.cca, type = "h")
if (interactive()){
ordirgl(vare.cca)
}